---
id: tutorial-production-evaluation
title: Evaluating LLMs in Production
sidebar_label: Evaluating LLMs in Production
---

## Quick Summary

In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to **evaluating LLMs in production**, there are 2 key aspects to focus on:

- [Online Evaluations](https://documentation.confident-ai.com)
- [Human-in-the-Loop feedback](https://documentation.confident-ai.com)

Before we begin, first make sure you are logged in to Confident AI:

```python
deepeval login
```

### Online Evaluations

Think of online evaluations as your **first line of defense** in identifying bad or failing responses. While they may not be as valuable as human feedback, they can still be incredibly effective in flagging potentially failing test cases. These flagged cases can then be reviewed by human evaluators, as it’s simply not feasible to manually review every single response logged in production.

:::note
It's important to note that metrics in production are **reference-less metrics**, because they do not have ground truth references (expected output and context).
:::

### Human-in-the-Loop Feedback

Human feedback goes beyond domain experts or dedicated reviewers—it also includes direct input from your users. This kind of feedback is essential for refining your model's performance. We’ll discuss how to collect and leverage user feedback in greater detail in the following sections.

## Setting up Online Evaluations

### OpenAI API key

It's extremely simple to set up online evaluations on Confident AI. Simply navigate to the settings page and input your `OPENAI_API_KEY`. This allows Confident AI to generate evaluation scores using OpenAI models.

:::tip
While Confident AI uses OpenAI models by default, the platform fully supports **custom models** for online evaluations. For more information, feel free to contact us at support@confident-ai.com.
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.us-east-1.amazonaws.com/tutorial_production_eval_01.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

### Turn on your Metrics

Next, navigate to the **Online Evaluations** page and scroll down to view the list of available referenceless metrics. Here, you can toggle metrics on or off, adjust thresholds for each metric, and optionally enable strict mode.

:::info
You can also define **custom metrics in production** with a clear evaluation criteria or set of steps, provided they are referenceless (depends on `input`, `actual_output`, and/or `retrieval_context`)
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_02.png"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

Once the metrics are enabled, all incoming responses will be evaluated automatically. In the next section, we’ll explore how to filter these responses based on metric scores and incorporate human-in-the-loop feedback.

## Human-in-the-Loop Evaluation

### Metric-based Filtering

Notice that in the previous step, we toggled the following metrics: Answer Relevancy, Faithfulness, Bias, and Contextual Relevancy. Let's say we're trying to evaluate how our retriever (RAG engine tool) is performing in production. We'll need to look at all the responses that didn't pass the 0.5 threshold for **Contextual Relevancy**.

:::info  
The observatory allows you to easily **filter for failing responses** based on metric scores, as well as default properties, hyperparameters, and any custom data. Visit [Confident AI](https://app.confident-ai.com/) to try it out now.  
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_03.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

We'll examine this specific response, where our medical chatbot retrieved some information about coughing and pollution and diagnosed the user's condition as a respiratory infection, possibly bronchitis. To understand why contextual relevancy failed despite this response passing every other metric, we'll **click inspect to take a closer look**.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_04.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

### Inspecting Metric Scores

Navigate to the **Metrics** tab in the side panel to view the metric scores in detail. As with any metric in DeepEval, online evaluation metrics are supported by reasoning and detailed logs, enabling anyone reviewing these responses to easily understand why a specific metric is failing and trace the steps through the score calculations.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_05.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

Scrolling down to **Contextual Relevancy**, we see that while the retrieved context is somewhat relevant (scoring 0.45), much of it discusses unrelated topics, primarily focusing on environmental problems, potentially causing the LLM to generate an unclear diagnosis from the lack of relevant information.

:::info
Whether this contextual relevancy failure indicates a need to reduce `chunk_size`, increasing the number of nodes in the retrieval context, or expanding your knowledge base, it's crucial to track this response. We'll achieve this by leaving feedback.
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_06.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

### Leaving Human Feedback

For each response, you'll find an option to leave feedback above the various sub-tabs. For this particular response, let's assign **a rating of 2 stars**, citing the lack of comprehensive context leading to an unclear diagnosis. However, the answer remains relevant, unbiased, and faithful.

:::tip
You may **optionally provide an expected response**, which can be helpful if you plan to add this response to a dataset for further evaluation and testing. We'll explore how to add responses to a dataset in later sections.
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_07.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

You can also leave feedback on entire conversations instead of individual responses. To do this, click on the conversation ID you’re interested in and click **Leave Feedback**, where you'll encounter a familiar interface. As with individual responses, you can also add these monitored conversations to datasets.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_075.svg"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

### Inspecting Human Feedback

All feedback, whether individual or conversational, can be accessed on the **Human Feedback** page. Here, you can filter feedback based on various criteria such as provider, rating, expected response, and more. To add responses to a dataset, simply check the relevant feedback, go to actions, and click add response to dataset.

:::info
These feedback-based filters are also available in the **Observatory**, meaning you can filter for responses based on various human feedback variables.
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_08.svg"
    alt="Feedback Page"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

We've created a dataset called **Failing Responses**, specifically designed for collecting feedback on failing responses. This is the same evaluation dataset referenced in the **Preparing Your Evaluation Dataset** section. By aggregating these failing responses, you can further refine and improve your LLM's performance in future testing environments.

:::tip
It may be helpful to **categorize different types of failing feedback** into separate datasets (e.g., one for retriever failures, another for LLM failures, etc.).
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/tutorial_production_eval_09.svg"
    alt="Failing Responses Dataset"
    style={{
      marginTop: "0px",
      marginBottom: "20px",
      height: "auto",
      maxHeight: "800px",
      borderBottomLeftRadius: "0px",
      borderBottomRightRadius: "0px",
    }}
  />
</div>

### User Provided Feedback

In addition to leaving feedback from the developer's side, you can also set up your LLM to receive user feedback with just one line of code. Here's how to set it up:

```python
import deepeval

response_id = deepeval.monitor(...)

deepeval.send_feedback(
    response_id=response_id,
    rating=5,
    explanation="...",
    expected_response="..."
)
```

For example, in our medical chatbot, you might want to ask users if they were satisfied with the service. While this feedback may not be as detailed as that provided by your team, it is far more abundant and can offer insights into the areas of your LLM that drive the most value. Here’s how the code might look:

```python
import deepeval
import time

class MedicalAppointmentSystem():
    ...
    def interactive_session(self):
        print("Welcome to the Medical Diagnosis and Booking System!")
        print("Please enter your symptoms or ask about appointment details.")

        while True:
            user_input = input("Your query: ")

            if user_input.lower() == 'exit':
              # Gather user feedback before exiting
              try:
                  satisfied = input("Were you satisfied with the response? (yes/no): ").strip().lower()
                  if satisfied not in ['yes', 'no']:
                      print("Invalid input. Assuming 'no'.")
                      satisfied = 'no'
                  rating = 5 if satisfied == 'yes'
                  explanation = input("Optional: Please provide an explanation (or press Enter to skip): ").strip()

                  deepeval.send_feedback(
                      response_id="place_holder_response_id",
                      rating=rating,
                      explanation=explanation if explanation else "No explanation provided.",
                  )
                  print("Thank you for your feedback!")

              except Exception as e:
                  print(f"An error occurred while sending feedback: {e}")
              break
        ...
```

:::info
**Balancing user satisfaction with the level of detail in feedback** is essential. For instance, while we provide a rating scale from 1 to 5, we simplify it into a binary option: whether the user was satisfied or not.
:::
